Skip to content

feat(helm): Add S3 storage support for archives, streams, and log inputs.#1825

Merged
junhaoliao merged 14 commits intoy-scope:mainfrom
junhaoliao:helm-s3
Jan 7, 2026
Merged

feat(helm): Add S3 storage support for archives, streams, and log inputs.#1825
junhaoliao merged 14 commits intoy-scope:mainfrom
junhaoliao:helm-s3

Conversation

@junhaoliao
Copy link
Member

@junhaoliao junhaoliao commented Dec 19, 2025

Description

Note

This PR is part of the ongoing work for #1309. More PRs will be submitted until the Helm chart is complete and fully functional.

This PR adds S3/AWS storage support to the CLP Helm chart, enabling archives and stream outputs to be stored in S3 instead of local filesystem storage. The changes include:

New Configuration Options:

  • Add aws_config_directory to values.yaml for specifying the path to AWS config files (e.g., ~/.aws)
  • Support for type: "s3" in archive_output.storage and stream_output.storage
  • Support for type: "s3" in logs_input configuration

New Resources:

  • compression-worker-staged-archives-pv.yaml and compression-worker-staged-archives-pvc.yaml for staging archives before S3 upload
  • query-worker-staged-streams-pv.yaml and query-worker-staged-streams-pvc.yaml for staging streams before S3 upload

Helper Templates:

  • Add clp.awsConfigVolumeMount and clp.awsConfigVolume helpers for mounting AWS configuration

Deployment Changes:

  • Make logsInputVolume mounts conditional on logs_input.type == "fs"
  • Make shared-data-archives mounts conditional on archive_output.storage.type == "fs"
  • Make shared-data-streams mounts conditional on stream_output.storage.type == "fs"
  • Add aws_config_directory volume mounts (conditional) to compression-scheduler, compression-worker, query-worker, garbage-collector, and webui deployments
  • Add AWS credentials environment variables to webui deployment (when using credentials auth type)
  • Add staged-archives mount to compression-worker (when archive_output.storage.type == "s3")
  • Add staged-streams mount to query-worker (when stream_output.storage.type == "s3")

ConfigMap Updates:

  • Add S3 configuration support for archive_output.storage, stream_output.storage, and logs_input
  • Add aws_config_directory field to clp-config
  • Update webui-client-settings.json and webui-server-settings.json to support S3 stream files

Checklist

  • The PR satisfies the contribution guidelines.
  • This is a breaking change and that has been indicated in the PR title, OR this isn't a
    breaking change.
  • Necessary docs have been updated, OR no docs need to be updated.

Validation performed

0. Configure S3 input and outputs in values.yaml:

(Note: aws_authentication.type = profile does not work with the API server and therefore is untested.)

  logs_input:
    type: "s3"
    aws_authentication:
      type: "credentials"
      credentials:
        access_key_id: "xxx"
        secret_access_key: "xxxx"

  archive_output:
    storage:
      type: "s3"
      staging_directory: "/tmp/clp/var/data/staged-archives"
      s3_config:
        region_code: "us-east-2"
        bucket: "xxxxx"
        key_prefix: "/"
        aws_authentication:
          type: "credentials"
          credentials:
            access_key_id: "xxx"
            secret_access_key: "xxxx"

  stream_output:
    storage:
      type: "s3"
      staging_directory: "/tmp/clp/var/data/staged-streams"
      s3_config:
        region_code: "us-east-2"
        bucket: "xxxxx"
        key_prefix: "/"
        aws_authentication:
          type: "credentials"
          credentials:
            access_key_id: "xxx"
            secret_access_key: "xxxx"

1. Helm chart deployment

cd tools/deployment/package-helm
./test.sh

# observed "All jobs completed and services are ready."

2. Configured clp-package/etc/clp-config.yaml

database:
  port: 30306

logs_input:
  type: "s3"
  aws_authentication:
    type: "credentials"
    credentials:
      access_key_id: "xxx"
      secret_access_key: "xxxx"

3. Log ingestion

cd build/clp-package
./sbin/compress-from-s3.sh --timestamp-key=timestamp s3-object https://xxxxx.s3.us-east-2.amazonaws.com/postgresql.jsonl

4. Accessed the WebuI to confirm the logs were ingested

image

5. Performed a query in the WebUI with query string "1" and observed results were returned:

image

6. Clicked the "Original file" link which redirected to the log viewer with the logs' context restored:

image

Summary by CodeRabbit

  • New Features

    • Added S3 storage support for archives, streams and logs with staging options and conditional staged-volume provisioning.
    • Introduced optional AWS config directory support for credential management and optional AWS env vars when needed.
  • Chores

    • Updated chart version to 0.1.2-dev.15.
    • Converted several mounts/volumes to conditional provisioning based on storage settings and consolidated config volume handling.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 19, 2025

Warning

Rate limit exceeded

@junhaoliao has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 0 minutes and 8 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 2e35de0 and c9dcb52.

📒 Files selected for processing (3)
  • tools/deployment/package-helm/Chart.yaml
  • tools/deployment/package-helm/templates/configmap.yaml
  • tools/deployment/package-helm/values.yaml

Walkthrough

Conditionalizes volume/volumeMount provisioning and adds S3-backed staging paths and optional AWS config directory support across Helm templates; introduces two AWS config helper templates and increments the chart version. ConfigMap values expanded to describe fs and s3 storage options and AWS authentication.

Changes

Cohort / File(s) Summary
Version & Metadata
tools/deployment/package-helm/Chart.yaml
Bumped chart version from 0.1.2-dev.14 to 0.1.2-dev.15.
Template Helpers
tools/deployment/package-helm/templates/_helpers.tpl
Added clp.awsConfigVolumeMount and clp.awsConfigVolume helper definitions (duplicated definitions present).
API / Config volume changes
tools/deployment/package-helm/templates/api-server-deployment.yaml
Made streams mount conditional on clpConfig.stream_output.storage.type == "fs"; introduced a named ConfigMap config volume and corresponding container mount for /etc/clp-config.yaml.
Compression scheduler / worker
tools/deployment/package-helm/templates/compression-scheduler-deployment.yaml, tools/deployment/package-helm/templates/compression-worker-deployment.yaml
Replaced unconditional mounts with conditionals: optional awsConfig mounts if clpConfig.aws_config_directory set; logs_input mounts only when logs_input.type == "fs"; archive staging PVC/hostPath logic branched for fs vs s3.
Staged-archives PV/PVC
tools/deployment/package-helm/templates/compression-worker-staged-archives-pv.yaml, tools/deployment/package-helm/templates/compression-worker-staged-archives-pvc.yaml
Added conditional PV and PVC creation for staged-archives when archive_output.storage.type == "s3".
Garbage collector
tools/deployment/package-helm/templates/garbage-collector-deployment.yaml
Reordered and conditionalized mounts: archives and streams mounts now conditional on storage type; added optional awsConfig mount; created top-level ConfigMap config and tmp volume entries.
Garbage collector logs PV/PVC formatting
tools/deployment/package-helm/templates/garbage-collector-logs-pv.yaml, tools/deployment/package-helm/templates/garbage-collector-logs-pvc.yaml
Reformatted multiline Helm conditionals without semantic change.
Query worker & staged-streams PV/PVC
tools/deployment/package-helm/templates/query-worker-deployment.yaml, tools/deployment/package-helm/templates/query-worker-staged-streams-pv.yaml, tools/deployment/package-helm/templates/query-worker-staged-streams-pvc.yaml
Added named ConfigMap config volume; conditional streams handling: PVC for fs, staged-streams PV/PVC for s3; optional awsConfig mount when configured.
Shared-data PV/PVC guards
tools/deployment/package-helm/templates/shared-data-archives-pv.yaml, tools/deployment/package-helm/templates/shared-data-archives-pvc.yaml, tools/deployment/package-helm/templates/shared-data-streams-pv.yaml, tools/deployment/package-helm/templates/shared-data-streams-pvc.yaml
Wrapped PV/PVC creation in conditionals so archives and streams shared volumes are created only when their storage types are fs.
Web UI changes
tools/deployment/package-helm/templates/webui-deployment.yaml
Conditional AWS env vars for S3 credential type credentials; conditional awsConfig, logsInput (fs), and streams mounts/volumes based on values.
ConfigMap and values
tools/deployment/package-helm/templates/configmap.yaml, tools/deployment/package-helm/values.yaml
Expanded clpConfig schema to support fs and s3 for archive_output, logs_input, stream_output including s3_config and nested aws_authentication; added top-level aws_config_directory (default null).
Tests / misc
tools/deployment/package-helm/test.sh
Expanded test data dir creation to include staged-archives and staged-streams.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested reviewers

  • kirkrodrigues

Pre-merge checks

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately and concisely summarizes the main change: adding S3 storage support for archives, streams, and logs input to the Helm chart, which is the central feature across all modified files.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

# Conflicts:
#	tools/deployment/package-helm/Chart.yaml
#	tools/deployment/package-helm/templates/api-server-deployment.yaml
#	tools/deployment/package-helm/templates/configmap.yaml
#	tools/deployment/package-helm/templates/garbage-collector-deployment.yaml
#	tools/deployment/package-helm/templates/garbage-collector-logs-pv.yaml
#	tools/deployment/package-helm/templates/garbage-collector-logs-pvc.yaml
#	tools/deployment/package-helm/test.sh
# Conflicts:
#	tools/deployment/package-helm/Chart.yaml
#	tools/deployment/package-helm/templates/compression-scheduler-deployment.yaml
#	tools/deployment/package-helm/templates/configmap.yaml
#	tools/deployment/package-helm/templates/garbage-collector-deployment.yaml
#	tools/deployment/package-helm/templates/query-worker-deployment.yaml
#	tools/deployment/package-helm/templates/webui-deployment.yaml
@junhaoliao junhaoliao marked this pull request as ready for review January 5, 2026 21:23
@junhaoliao junhaoliao requested a review from a team as a code owner January 5, 2026 21:23
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Fix all issues with AI Agents 🤖
In @tools/deployment/package-helm/templates/_helpers.tpl:
- Around line 250-260: The helper "clp.awsConfigVolumeMount" assumes
.Values.clpConfig.aws_config_directory is set and will render an invalid mount
if it is null; update the helper's docstring to state the precondition that
.Values.clpConfig.aws_config_directory must be non-null (e.g., "Requires
.Values.clpConfig.aws_config_directory to be set/non-empty; callers must guard
with an if check"), and optionally mention that callers should wrap the helper
with {{- if .Values.clpConfig.aws_config_directory }} ... {{- end }} to avoid
misuse.
- Around line 268-273: The Helm template define "clp.awsConfigVolume" uses a
hostPath with type "Directory" sourced from
.Values.clpConfig.aws_config_directory; update the deployment docs and add a
comment in values.yaml next to clpConfig.aws_config_directory to explicitly
state that when this value is set, the specified directory must already exist on
every target node (or be created during pre-deployment) because Kubernetes will
fail to mount a hostPath Directory that does not exist. Ensure the note mentions
potential alternatives (e.g., using a different volume type) and the exact
values key (.Values.clpConfig.aws_config_directory) so users can find and
configure it easily.

In @tools/deployment/package-helm/templates/configmap.yaml:
- Around line 286-295: The template accesses .aws_authentication.type without
ensuring .aws_authentication exists; change the conditional to guard for nil,
e.g. use an AND: replace {{- if eq .aws_authentication.type "profile" }} with
{{- if and .aws_authentication (eq .aws_authentication.type "profile") }} so
StreamFilesS3Profile is only read when aws_authentication is present, keeping
the else branch that emits null.
- Line 288: The template currently renders "StreamFilesS3Region" as a JSON
string "null" because {{ .region_code | default "null" | quote }} wraps the
default in quotes; change the template to emit a real JSON null when
.region_code is unset by using a conditional: replace the expression with {{ if
.region_code }}{{ .region_code | quote }}{{ else }}null{{ end }} so
"StreamFilesS3Region" gets either a quoted region string or an unquoted null.

In @tools/deployment/package-helm/templates/garbage-collector-deployment.yaml:
- Around line 73-74: The tmp volume mount incorrectly uses mountPath "/var/log";
update the container volumeMount with name "tmp" to use mountPath "/var/tmp"
(matching other components like compression-worker) to avoid conflicts with the
existing logs mount at "/var/log/garbage_collector".

In @tools/deployment/package-helm/templates/query-worker-staged-streams-pv.yaml:
- Around line 1-11: The values referenced by the PV templates
(.Values.clpConfig.stream_output.storage.staging_directory and
.Values.clpConfig.archive_output.storage.staging_directory) are missing from
values.yaml causing null paths and invalid PersistentVolume manifests; add these
two keys to values.yaml (or expose them as configurable entries) with sensible
defaults such as "/var/data/staged-streams" for
clpConfig.stream_output.storage.staging_directory and
"/var/data/staged-archives" for
clpConfig.archive_output.storage.staging_directory so the Helm templates that
call include "clp.createLocalPv" receive non-null staging_directory values and
produce valid PV specs.

In @tools/deployment/package-helm/templates/webui-deployment.yaml:
- Around line 62-70: The template injects AWS_ACCESS_KEY_ID and
AWS_SECRET_ACCESS_KEY directly from .Values.clpConfig.stream_output.storage,
which exposes credentials in plain text; instead, move AWS creds into a
Kubernetes Secret (e.g., a secret named via the same naming helper used
elsewhere like include "clp.fullname") and change the deployment env entries for
AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to use valueFrom.secretKeyRef
referencing that Secret (follow the same pattern used for the DB credentials
block and reference secret keys such as access-key-id and secret-access-key);
remove the direct usage of .s3_config.aws_authentication.credentials.* in the
deployment template and ensure values.yaml contains a credentials.aws section
intended only for secure injection (or require users to supply the Secret
externally).
📜 Review details

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 742f409 and 715163a.

📒 Files selected for processing (21)
  • tools/deployment/package-helm/Chart.yaml
  • tools/deployment/package-helm/templates/_helpers.tpl
  • tools/deployment/package-helm/templates/api-server-deployment.yaml
  • tools/deployment/package-helm/templates/compression-scheduler-deployment.yaml
  • tools/deployment/package-helm/templates/compression-worker-deployment.yaml
  • tools/deployment/package-helm/templates/compression-worker-staged-archives-pv.yaml
  • tools/deployment/package-helm/templates/compression-worker-staged-archives-pvc.yaml
  • tools/deployment/package-helm/templates/configmap.yaml
  • tools/deployment/package-helm/templates/garbage-collector-deployment.yaml
  • tools/deployment/package-helm/templates/garbage-collector-logs-pv.yaml
  • tools/deployment/package-helm/templates/garbage-collector-logs-pvc.yaml
  • tools/deployment/package-helm/templates/query-worker-deployment.yaml
  • tools/deployment/package-helm/templates/query-worker-staged-streams-pv.yaml
  • tools/deployment/package-helm/templates/query-worker-staged-streams-pvc.yaml
  • tools/deployment/package-helm/templates/shared-data-archives-pv.yaml
  • tools/deployment/package-helm/templates/shared-data-archives-pvc.yaml
  • tools/deployment/package-helm/templates/shared-data-streams-pv.yaml
  • tools/deployment/package-helm/templates/shared-data-streams-pvc.yaml
  • tools/deployment/package-helm/templates/webui-deployment.yaml
  • tools/deployment/package-helm/test.sh
  • tools/deployment/package-helm/values.yaml
🧰 Additional context used
🧠 Learnings (8)
📓 Common learnings
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.
📚 Learning: 2025-05-06T10:07:04.654Z
Learnt from: kirkrodrigues
Repo: y-scope/clp PR: 881
File: components/core/tools/scripts/lib_install/macos/install-all.sh:11-12
Timestamp: 2025-05-06T10:07:04.654Z
Learning: In CLP installation scripts, temporary directories with downloaded files should not be automatically cleaned up on failure (e.g., with EXIT traps) to preserve artifacts for debugging purposes.

Applied to files:

  • tools/deployment/package-helm/test.sh
📚 Learning: 2025-12-04T03:31:55.239Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1690
File: tools/deployment/package/docker-compose-all.yaml:424-427
Timestamp: 2025-12-04T03:31:55.239Z
Learning: In tools/deployment/package/docker-compose-all.yaml, the query-worker service writes to /var/data/streams (CLP_STREAM_OUTPUT_DIR_HOST mount), so this directory must remain read-write and should not be mounted with :ro flag.

Applied to files:

  • tools/deployment/package-helm/templates/webui-deployment.yaml
  • tools/deployment/package-helm/templates/shared-data-streams-pvc.yaml
  • tools/deployment/package-helm/templates/compression-worker-deployment.yaml
  • tools/deployment/package-helm/templates/query-worker-staged-streams-pv.yaml
  • tools/deployment/package-helm/templates/api-server-deployment.yaml
  • tools/deployment/package-helm/templates/query-worker-staged-streams-pvc.yaml
  • tools/deployment/package-helm/templates/query-worker-deployment.yaml
📚 Learning: 2025-07-23T09:54:45.185Z
Learnt from: Bill-hbrhbr
Repo: y-scope/clp PR: 1122
File: components/core/src/clp/clp/CMakeLists.txt:175-195
Timestamp: 2025-07-23T09:54:45.185Z
Learning: In the CLP project, when reviewing CMakeLists.txt changes that introduce new compression library dependencies (BZip2, LibLZMA, LZ4, ZLIB), the team prefers to address conditional linking improvements in separate PRs rather than expanding the scope of focused migration PRs like the LibArchive task-based installation migration.

Applied to files:

  • tools/deployment/package-helm/templates/compression-worker-staged-archives-pvc.yaml
📚 Learning: 2025-10-07T07:54:32.427Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1178
File: components/clp-py-utils/clp_py_utils/clp_config.py:47-47
Timestamp: 2025-10-07T07:54:32.427Z
Learning: In components/clp-py-utils/clp_py_utils/clp_config.py, the CONTAINER_AWS_CONFIG_DIRECTORY constant is intentionally set to pathlib.Path("/") / ".aws" (i.e., `/.aws`) rather than a user-specific home directory. This hardcoded path is part of the container orchestration design.

Applied to files:

  • tools/deployment/package-helm/values.yaml
📚 Learning: 2025-11-28T15:12:53.530Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1655
File: components/clp-package-utils/clp_package_utils/controller.py:183-189
Timestamp: 2025-11-28T15:12:53.530Z
Learning: In the y-scope/clp project, root database credentials (root username and password) are required configuration items for all deployments. The `credentials[ClpDbUserType.ROOT]` entry is guaranteed to exist and can be accessed directly without optional handling.

Applied to files:

  • tools/deployment/package-helm/values.yaml
📚 Learning: 2025-01-13T21:18:54.629Z
Learnt from: haiqi96
Repo: y-scope/clp PR: 662
File: components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py:167-186
Timestamp: 2025-01-13T21:18:54.629Z
Learning: In the CLP system, streams can be deterministically recreated if needed for debugging purposes, so it's acceptable to delete them even if their S3 upload fails.

Applied to files:

  • tools/deployment/package-helm/templates/query-worker-staged-streams-pv.yaml
  • tools/deployment/package-helm/templates/query-worker-staged-streams-pvc.yaml
📚 Learning: 2025-12-19T05:03:42.629Z
Learnt from: junhaoliao
Repo: y-scope/clp PR: 1818
File: tools/deployment/package-helm/templates/configmap.yaml:12-12
Timestamp: 2025-12-19T05:03:42.629Z
Learning: In the y-scope/clp Helm chart, the `api_server.host` configuration field in the ConfigMap (tools/deployment/package-helm/templates/configmap.yaml) is unused because no other k8s internal services need to reach the API server—it's only accessed from outside the cluster via NodePort.

Applied to files:

  • tools/deployment/package-helm/templates/api-server-deployment.yaml
🪛 YAMLlint (1.37.1)
tools/deployment/package-helm/templates/garbage-collector-logs-pv.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)

tools/deployment/package-helm/templates/shared-data-archives-pvc.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)


[error] 8-8: too many spaces inside braces

(braces)


[error] 9-9: too many spaces inside braces

(braces)

tools/deployment/package-helm/templates/shared-data-archives-pv.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)


[error] 11-11: too many spaces inside braces

(braces)

tools/deployment/package-helm/templates/shared-data-streams-pvc.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)


[error] 8-8: too many spaces inside braces

(braces)


[error] 9-9: too many spaces inside braces

(braces)

tools/deployment/package-helm/templates/compression-worker-staged-archives-pvc.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)


[error] 8-8: too many spaces inside braces

(braces)


[error] 9-9: too many spaces inside braces

(braces)

tools/deployment/package-helm/templates/garbage-collector-logs-pvc.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)

tools/deployment/package-helm/templates/shared-data-streams-pv.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)


[error] 11-11: too many spaces inside braces

(braces)

tools/deployment/package-helm/templates/garbage-collector-deployment.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)

tools/deployment/package-helm/templates/query-worker-staged-streams-pv.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)


[error] 10-10: too many spaces inside braces

(braces)


[error] 11-11: too many spaces inside braces

(braces)

tools/deployment/package-helm/templates/query-worker-staged-streams-pvc.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)


[error] 8-8: too many spaces inside braces

(braces)


[error] 9-9: too many spaces inside braces

(braces)

tools/deployment/package-helm/templates/compression-worker-staged-archives-pv.yaml

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)


[error] 10-10: too many spaces inside braces

(braces)


[error] 11-11: too many spaces inside braces

(braces)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: package-image
🔇 Additional comments (26)
tools/deployment/package-helm/Chart.yaml (1)

3-3: Version bump is appropriate for the feature release.

The chart version increment from 0.1.2-dev.14 to 0.1.2-dev.15 correctly reflects the addition of S3 storage support and associated template/configuration enhancements across the deployment manifests.

tools/deployment/package-helm/test.sh (1)

58-59: LGTM! Directory structure supports S3 staging.

The addition of staged-archives and staged-streams directories correctly aligns with the PR's S3 storage support feature, where these staging areas temporarily hold data before upload to S3.

tools/deployment/package-helm/templates/shared-data-archives-pvc.yaml (1)

1-9: LGTM! Conditional PVC creation is correctly implemented.

The conditional guard appropriately creates the shared archives PVC only when using filesystem storage. This aligns with the broader pattern of conditionally mounting volumes based on storage type.

tools/deployment/package-helm/templates/shared-data-archives-pv.yaml (1)

1-11: LGTM! PV conditional aligns with corresponding PVC.

The PV creation is correctly gated on filesystem storage type, ensuring PV and PVC are created together. The hostPath binding from configuration is appropriate for local storage.

tools/deployment/package-helm/templates/query-worker-staged-streams-pvc.yaml (1)

1-9: LGTM! Staging PVC correctly scoped for S3 storage.

The conditional creation of the staging PVC when using S3 storage is appropriate, as streams need to be staged locally before upload. The ReadWriteOnce access mode is correct for single-writer staging scenarios.

Verify that the 50Gi capacity is sufficient for your expected staging workload. Consider making this configurable via values.yaml if different deployments have varying staging requirements.

Based on learnings, streams can be deleted if S3 upload fails, so staging volume sizing is primarily a performance consideration rather than a data durability concern.

tools/deployment/package-helm/templates/garbage-collector-logs-pv.yaml (1)

1-3: LGTM! Formatting improvement enhances readability.

The multi-line conditional format improves readability without changing the logic. This aligns with the formatting style used in other templates in this PR.

tools/deployment/package-helm/templates/shared-data-streams-pv.yaml (1)

1-11: LGTM! Streams PV conditional follows established pattern.

The conditional guard for filesystem storage is consistent with the streams PVC and archives PV approach. Both the PV and PVC use the matching conditional guard (eq .Values.clpConfig.stream_output.storage.type "fs"), and the 20Gi capacity with ReadWriteMany access mode is appropriate for streams, ensuring PV and PVC creation remain synchronized.

tools/deployment/package-helm/templates/shared-data-streams-pvc.yaml (1)

1-9: LGTM! Conditional PVC creation is correct.

The conditional logic appropriately creates the shared-data-streams PVC only when filesystem storage is configured. When S3 storage is used, staged volumes are created instead, making this shared PVC unnecessary.

tools/deployment/package-helm/templates/garbage-collector-logs-pvc.yaml (1)

1-3: LGTM! Formatting improvement enhances readability.

The multi-line conditional format is clearer and consistent with the style used in other templates within this PR.

tools/deployment/package-helm/templates/compression-worker-staged-archives-pv.yaml (1)

1-11: Ensure staging_directory validation for S3 storage.

The PV creation follows the correct pattern. Similar to the staged-streams PV, verify that .Values.clpConfig.archive_output.storage.staging_directory is validated as required when storage.type is "s3", and ensure the directory exists on worker nodes.

Additionally, consider if the 50Gi capacity for both staged PVs (archives and streams) should be independently configurable, as ingestion patterns might differ between archives and streams.

tools/deployment/package-helm/templates/compression-worker-staged-archives-pvc.yaml (1)

1-9: Consider making staging volume capacity configurable.

The PVC logic is correct for S3 storage. However, the 50Gi capacity for the staged-archives volume is hardcoded. Depending on ingestion volume and network speed to S3, users might need to adjust this capacity.

Consider adding a configurable value such as .Values.clpConfig.archive_output.storage.staging_capacity to allow users to tune this based on their workload characteristics.

tools/deployment/package-helm/values.yaml (1)

156-158: LGTM - Clean configuration addition.

The new aws_config_directory configuration is well-documented and defaults to null (disabled), providing safe opt-in behavior for AWS config mounting.

tools/deployment/package-helm/templates/compression-scheduler-deployment.yaml (1)

83-88: LGTM - Conditional volume mounting implemented correctly.

The conditional mounting logic properly guards helper invocations and aligns with the storage-type-based design. AWS config is mounted only when configured, and logs input mount is added only for filesystem-based logs.

Also applies to: 108-113

tools/deployment/package-helm/templates/api-server-deployment.yaml (1)

70-76: LGTM - Streams mount correctly conditionalized.

The streams volume mount is now properly conditional on stream_output.storage.type == "fs", which aligns with the S3 storage support goals. When using S3 storage, the api-server won't need local access to streams.

Also applies to: 100-106

tools/deployment/package-helm/templates/webui-deployment.yaml (1)

83-95: LGTM - Conditional volume mounting implemented correctly.

The conditional mounting logic for AWS config, logs input, and streams is properly implemented with appropriate guards. This correctly enables the webui to work with both filesystem and S3 storage backends.

Also applies to: 114-126

tools/deployment/package-helm/templates/compression-worker-deployment.yaml (2)

62-80: LGTM! Conditional volume mounts correctly handle fs vs s3 storage types.

The conditional logic properly handles:

  • Archives mounted from shared-data PVC for fs type
  • Staged-archives mounted for s3 type (for local staging before S3 upload)
  • AWS config directory mounted when configured
  • Logs input volume mounted only for fs type

103-124: LGTM! Volume definitions align with their corresponding mounts.

The volumes section correctly mirrors the volumeMounts conditionals, ensuring that volumes are only created when their corresponding mounts are enabled.

tools/deployment/package-helm/templates/garbage-collector-deployment.yaml (3)

1-3: YAMLlint false positive - Helm templating syntax is correct.

The static analysis error is a false positive. YAMLlint doesn't understand Helm templating, and {{- if or ... }} is valid Helm syntax for conditional deployment based on retention period settings.


75-91: LGTM! Conditional volume mounts for archives, streams, and AWS config.

The garbage-collector correctly conditionally mounts:

  • Archives volume only when archive_output.storage.type is fs
  • Streams volume only when stream_output.storage.type is fs
  • AWS config when aws_config_directory is set

This aligns with the S3 storage support feature where the garbage-collector would access S3 directly rather than through mounted volumes.


103-124: LGTM! Volume definitions match the conditional mount logic.

tools/deployment/package-helm/templates/query-worker-deployment.yaml (2)

53-79: LGTM! Volume mounts correctly handle fs vs s3 storage types.

The query-worker appropriately:

  • Mounts config as read-only
  • Conditionally mounts archives for fs storage type
  • Mounts streams for fs or staged-streams for s3 storage type

Based on learnings, the streams/staged-streams directories correctly remain read-write (no :ro flag) since the query-worker writes stream data.


97-122: LGTM! Volume definitions properly align with mount conditions.

tools/deployment/package-helm/templates/configmap.yaml (4)

22-57: LGTM! S3 configuration for archive_output storage is well-structured.

The configuration correctly:

  • Supports both fs and s3 storage types
  • Makes endpoint_url and region_code optional (aligned with learnings about custom S3-compatible endpoints)
  • Properly nests aws_authentication with support for profile and credentials types

93-116: LGTM! logs_input S3 configuration follows established pattern.


152-191: LGTM! stream_output storage S3 configuration mirrors archive_output pattern.

Good consistency in the configuration structure across archive_output and stream_output.


193-195: LGTM! Top-level aws_config_directory is correctly conditional.

Comment on lines +250 to +260
{{/*
Creates a volumeMount for the AWS config directory.

@param {object} . Root template context
@return {string} YAML-formatted volumeMount definition
*/}}
{{- define "clp.awsConfigVolumeMount" -}}
name: "aws-config"
mountPath: {{ .Values.clpConfig.aws_config_directory | quote }}
readOnly: true
{{- end }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Document the precondition that aws_config_directory must be non-null.

The helper assumes aws_config_directory is set. While all current callers guard with {{- if .Values.clpConfig.aws_config_directory }}, documenting this precondition in the helper's docstring would prevent future misuse.

📝 Suggested documentation enhancement
 {{/*
 Creates a volumeMount for the AWS config directory.
 
+@pre {string} .Values.clpConfig.aws_config_directory must be non-null
 @param {object} . Root template context
 @return {string} YAML-formatted volumeMount definition
 */}}
🤖 Prompt for AI Agents
In @tools/deployment/package-helm/templates/_helpers.tpl around lines 250-260,
The helper "clp.awsConfigVolumeMount" assumes
.Values.clpConfig.aws_config_directory is set and will render an invalid mount
if it is null; update the helper's docstring to state the precondition that
.Values.clpConfig.aws_config_directory must be non-null (e.g., "Requires
.Values.clpConfig.aws_config_directory to be set/non-empty; callers must guard
with an if check"), and optionally mention that callers should wrap the helper
with {{- if .Values.clpConfig.aws_config_directory }} ... {{- end }} to avoid
misuse.

Comment on lines +268 to +273
{{- define "clp.awsConfigVolume" -}}
name: "aws-config"
hostPath:
path: {{ .Values.clpConfig.aws_config_directory | quote }}
type: "Directory"
{{- end }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for setup scripts or documentation mentioning AWS config directory
rg -i "aws.?config.?dir" -g "*.md" -g "*.sh" -g "*.yaml" -C3

Repository: y-scope/clp

Length of output: 16202


🏁 Script executed:

#!/bin/bash
# Search for deployment guides and README files with setup instructions
find . -type f -name "README*" -o -name "INSTALL*" -o -name "DEPLOY*" | head -20

# Also search for helm chart documentation
find tools/deployment/package-helm -type f \( -name "*.md" -o -name "README*" \)

Repository: y-scope/clp

Length of output: 494


🏁 Script executed:

#!/bin/bash
# Search for Kubernetes/Helm deployment documentation
find docs -type f -name "*.md" | xargs rg -l "kubernetes\|helm\|deploy" -i

# Check if there's documentation in the helm chart directory
ls -la tools/deployment/package-helm/

# Look for prerequisites or setup documentation
rg "prerequisite\|setup\|initialize\|create.*director" docs -i -A2 | head -50

Repository: y-scope/clp

Length of output: 464


🏁 Script executed:

#!/bin/bash
# Check values.yaml for aws_config_directory documentation
sed -n '1,100p' tools/deployment/package-helm/values.yaml | grep -A5 -B2 "aws_config"

# Search for deployment guides and prerequisites documentation
find docs -type f -name "*.md" | head -20

Repository: y-scope/clp

Length of output: 976


🏁 Script executed:

#!/bin/bash
# Find deployment/installation guides
find docs/src/user-docs -type f -name "*.md" | sort

# Also check for any prerequisite documentation
rg "hostPath\|directory.*exist\|prerequisite" docs -i -B2 -A2

Repository: y-scope/clp

Length of output: 1602


🏁 Script executed:

#!/bin/bash
# Check the object storage config documentation for aws_config_directory setup
cat docs/src/user-docs/guides-using-object-storage/clp-config.md

# Also check if there's any helm/kubernetes specific deployment guide
find docs -type f -name "*.md" | xargs grep -l "helm\|kubernetes" -i

Repository: y-scope/clp

Length of output: 5960


🏁 Script executed:

#!/bin/bash
# Check if there's any helm-specific deployment documentation
find docs -type f -name "*.md" -path "*/user-docs/*" | xargs grep -l "kubernetes\|helm\|deploy" -i

# Check the quick-start guides which might have setup instructions
cat docs/src/user-docs/quick-start/index.md

Repository: y-scope/clp

Length of output: 4023


🏁 Script executed:

#!/bin/bash
# Check multi-host deployment guide for setup instructions
cat docs/src/user-docs/guides-multi-host.md | head -100

# Check if there's any helm/kubernetes specific deployment documentation
find docs -name "*.md" -type f | xargs grep -l "hostPath" -i

Repository: y-scope/clp

Length of output: 4012


🏁 Script executed:

#!/bin/bash
# Check for Helm-specific deployment documentation
find docs -name "*.md" -type f | xargs grep -l "helm\|values.yaml" -i

# Check Chart.yaml for documentation references
cat tools/deployment/package-helm/Chart.yaml

# Search for any kubernetes prerequisites or setup documentation
rg "kubernetes|k8s|helm" docs/src/user-docs -i -l

Repository: y-scope/clp

Length of output: 662


🏁 Script executed:

#!/bin/bash
# Check for helm hooks or init scripts that might create the aws_config_directory
rg "initContainers\|initContainer" tools/deployment/package-helm/templates -A5

# Search for any helm hooks
rg "hook\|init" tools/deployment/package-helm/templates -i

# Check the full values.yaml for aws_config_directory context
grep -A10 -B5 "aws_config_directory" tools/deployment/package-helm/values.yaml

Repository: y-scope/clp

Length of output: 496


🏁 Script executed:

#!/bin/bash
# Check if there's any documentation about Kubernetes deployment or prerequisites
find docs -name "*.md" -type f | xargs grep -l "prerequisites\|deployment\|install" -i | head -10

# Check the main README for kubernetes/helm instructions
cat README.md | head -100

Repository: y-scope/clp

Length of output: 6051


Document the hostPath directory prerequisite for AWS config in deployment guides.

When aws_config_directory is set to a non-null value, Kubernetes requires the directory to exist on all target nodes before pod scheduling. The Helm chart uses hostPath volumes with type: "Directory", which will fail to mount if the path doesn't exist. Add a note to the deployment documentation and/or values.yaml comments stating that users must ensure this directory exists on all worker nodes (or is created as part of pre-deployment setup).

🤖 Prompt for AI Agents
In @tools/deployment/package-helm/templates/_helpers.tpl around lines 268-273,
The Helm template define "clp.awsConfigVolume" uses a hostPath with type
"Directory" sourced from .Values.clpConfig.aws_config_directory; update the
deployment docs and add a comment in values.yaml next to
clpConfig.aws_config_directory to explicitly state that when this value is set,
the specified directory must already exist on every target node (or be created
during pre-deployment) because Kubernetes will fail to mount a hostPath
Directory that does not exist. Ensure the note mentions potential alternatives
(e.g., using a different volume type) and the exact values key
(.Values.clpConfig.aws_config_directory) so users can find and configure it
easily.

Comment on lines +286 to +295
{{- if eq .Values.clpConfig.stream_output.storage.type "s3" }}
{{- with .Values.clpConfig.stream_output.storage.s3_config }}
"StreamFilesS3Region": {{ .region_code | default "null" | quote }},
"StreamFilesS3PathPrefix": {{ printf "%s/%s" .bucket .key_prefix | quote }},
{{- if eq .aws_authentication.type "profile" }}
"StreamFilesS3Profile": {{ .aws_authentication.profile | quote }},
{{- else }}
"StreamFilesS3Profile": null,
{{- end }}
{{- end }}{{/* with .Values.clpConfig.stream_output.storage.s3_config */}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential nil pointer dereference when accessing aws_authentication.type.

Line 290 accesses .aws_authentication.type without first checking if aws_authentication exists. If aws_authentication is not defined in the values, this will cause a template rendering error.

🔎 Proposed fix to add nil check
      {{- with .Values.clpConfig.stream_output.storage.s3_config }}
      "StreamFilesS3Region": {{ .region_code | default "null" | quote }},
      "StreamFilesS3PathPrefix": {{ printf "%s/%s" .bucket .key_prefix | quote }},
-      {{- if eq .aws_authentication.type "profile" }}
+      {{- if and .aws_authentication (eq .aws_authentication.type "profile") }}
      "StreamFilesS3Profile": {{ .aws_authentication.profile | quote }},
      {{- else }}
      "StreamFilesS3Profile": null,
      {{- end }}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
{{- if eq .Values.clpConfig.stream_output.storage.type "s3" }}
{{- with .Values.clpConfig.stream_output.storage.s3_config }}
"StreamFilesS3Region": {{ .region_code | default "null" | quote }},
"StreamFilesS3PathPrefix": {{ printf "%s/%s" .bucket .key_prefix | quote }},
{{- if eq .aws_authentication.type "profile" }}
"StreamFilesS3Profile": {{ .aws_authentication.profile | quote }},
{{- else }}
"StreamFilesS3Profile": null,
{{- end }}
{{- end }}{{/* with .Values.clpConfig.stream_output.storage.s3_config */}}
{{- with .Values.clpConfig.stream_output.storage.s3_config }}
"StreamFilesS3Region": {{ .region_code | default "null" | quote }},
"StreamFilesS3PathPrefix": {{ printf "%s/%s" .bucket .key_prefix | quote }},
{{- if and .aws_authentication (eq .aws_authentication.type "profile") }}
"StreamFilesS3Profile": {{ .aws_authentication.profile | quote }},
{{- else }}
"StreamFilesS3Profile": null,
{{- end }}
{{- end }}{{/* with .Values.clpConfig.stream_output.storage.s3_config */}}
🤖 Prompt for AI Agents
In @tools/deployment/package-helm/templates/configmap.yaml around lines 286-295,
The template accesses .aws_authentication.type without ensuring
.aws_authentication exists; change the conditional to guard for nil, e.g. use an
AND: replace {{- if eq .aws_authentication.type "profile" }} with {{- if and
.aws_authentication (eq .aws_authentication.type "profile") }} so
StreamFilesS3Profile is only read when aws_authentication is present, keeping
the else branch that emits null.

{{ .Values.clpConfig.stream_output.target_uncompressed_size | int }},
{{- if eq .Values.clpConfig.stream_output.storage.type "s3" }}
{{- with .Values.clpConfig.stream_output.storage.s3_config }}
"StreamFilesS3Region": {{ .region_code | default "null" | quote }},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Consider using unquoted null for proper JSON null value.

{{ .region_code | default "null" | quote }} will produce "null" (string) instead of null (JSON null). If the intent is to have a proper JSON null when region_code is not set, this needs adjustment.

🔎 Proposed fix for proper null handling
-      "StreamFilesS3Region": {{ .region_code | default "null" | quote }},
+      {{- if .region_code }}
+      "StreamFilesS3Region": {{ .region_code | quote }},
+      {{- else }}
+      "StreamFilesS3Region": null,
+      {{- end }}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"StreamFilesS3Region": {{ .region_code | default "null" | quote }},
{{- if .region_code }}
"StreamFilesS3Region": {{ .region_code | quote }},
{{- else }}
"StreamFilesS3Region": null,
{{- end }}
🤖 Prompt for AI Agents
In @tools/deployment/package-helm/templates/configmap.yaml around line 288, The
template currently renders "StreamFilesS3Region" as a JSON string "null" because
{{ .region_code | default "null" | quote }} wraps the default in quotes; change
the template to emit a real JSON null when .region_code is unset by using a
conditional: replace the expression with {{ if .region_code }}{{ .region_code |
quote }}{{ else }}null{{ end }} so "StreamFilesS3Region" gets either a quoted
region string or an unquoted null.

Comment on lines +73 to +74
- name: "tmp"
mountPath: "/var/log"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

rg -n "name: \"tmp\"" tools/deployment/package-helm/templates/ -A 3 -B 1

Repository: y-scope/clp

Length of output: 1270


🏁 Script executed:

cat tools/deployment/package-helm/templates/garbage-collector-deployment.yaml | head -100

Repository: y-scope/clp

Length of output: 3845


🏁 Script executed:

fd -t f "compression-worker" tools/deployment/package-helm/templates/

Repository: y-scope/clp

Length of output: 584


🏁 Script executed:

rg -l "compression-worker" tools/deployment/package-helm/templates/ -t yaml

Repository: y-scope/clp

Length of output: 584


🏁 Script executed:

rg -n "mountPath.*tmp" tools/deployment/package-helm/templates/ -B 2

Repository: y-scope/clp

Length of output: 380


🏁 Script executed:

rg -n "name: \"tmp\"" tools/deployment/package-helm/templates/ -A 2 -B 2

Repository: y-scope/clp

Length of output: 1234


🏁 Script executed:

cat tools/deployment/package-helm/templates/compression-worker-deployment.yaml | grep -A 10 -B 10 "name: \"tmp\"" | head -40

Repository: y-scope/clp

Length of output: 37


🏁 Script executed:

# Check the full garbage-collector deployment to understand the context
sed -n '1,120p' tools/deployment/package-helm/templates/garbage-collector-deployment.yaml | tail -60

Repository: y-scope/clp

Length of output: 2427


🏁 Script executed:

# Look for any references to /var/log in the garbage-collector deployment
grep -n "var/log" tools/deployment/package-helm/templates/garbage-collector-deployment.yaml

Repository: y-scope/clp

Length of output: 203


Change the tmp volume mount path from /var/log to /var/tmp.

The tmp volume is mounted at /var/log, which is inconsistent with how other components handle temporary volumes (compression-worker mounts tmp to /var/tmp). Additionally, mounting tmp to /var/log is non-standard and could create conflicts with the existing logs volume mount at /var/log/garbage_collector. This appears to be a copy-paste error.

🤖 Prompt for AI Agents
In @tools/deployment/package-helm/templates/garbage-collector-deployment.yaml
around lines 73-74, The tmp volume mount incorrectly uses mountPath "/var/log";
update the container volumeMount with name "tmp" to use mountPath "/var/tmp"
(matching other components like compression-worker) to avoid conflicts with the
existing logs mount at "/var/log/garbage_collector".

Comment on lines 1 to 11
{{- if eq .Values.clpConfig.stream_output.storage.type "s3" }}
{{- include "clp.createLocalPv" (dict
"root" .
"component_category" "query-worker"
"name" "staged-streams"
"nodeRole" "worker"
"capacity" "50Gi"
"accessModes" (list "ReadWriteOnce")
"hostPath" .Values.clpConfig.stream_output.storage.staging_directory
) }}
{{- end }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for staging_directory references
echo "=== Searching for staging_directory references ==="
rg -n "staging_directory" tools/deployment/package-helm/ -A 3 -B 3

echo ""
echo "=== Searching for values schema files ==="
fd "values.schema.json" tools/deployment/package-helm/

echo ""
echo "=== Searching for values.yaml with clpConfig.stream_output ==="
fd "values.yaml" tools/deployment/package-helm/ --exec head -200 {}

Repository: y-scope/clp

Length of output: 7520


🏁 Script executed:

#!/bin/bash
# Find and read the complete values.yaml
echo "=== Reading full values.yaml ==="
fd "values.yaml" tools/deployment/package-helm/ -x wc -l {}

echo ""
echo "=== Searching for staging_directory in entire values.yaml ==="
fd "values.yaml" tools/deployment/package-helm/ -x rg "staging_directory" {} -B 2 -A 2

echo ""
echo "=== Looking at stream_output section in values.yaml ==="
fd "values.yaml" tools/deployment/package-helm/ -x sed -n '/stream_output:/,/^  [a-z_]*:/p' {}

echo ""
echo "=== Looking at archive_output section in values.yaml ==="
fd "values.yaml" tools/deployment/package-helm/ -x sed -n '/archive_output:/,/^  [a-z_]*:/p' {}

Repository: y-scope/clp

Length of output: 1797


🏁 Script executed:

#!/bin/bash
# Examine configmap.yaml storage sections
echo "=== archive_output section in configmap.yaml ==="
rg -n "archive_output:" tools/deployment/package-helm/templates/configmap.yaml -A 30

echo ""
echo "=== stream_output section in configmap.yaml ==="
rg -n "stream_output:" tools/deployment/package-helm/templates/configmap.yaml -A 30

Repository: y-scope/clp

Length of output: 2716


🏁 Script executed:

#!/bin/bash
# Check if there's any linking or if PV should use different approach
echo "=== Checking if staging_directory should be in values.yaml (searching git history pattern) ==="
fd "values.yaml" tools/deployment/package-helm/ -x head -n 171 {}

echo ""
echo "=== Checking createLocalPv helper to understand what it expects ==="
rg -n "createLocalPv" tools/deployment/package-helm/templates/ -B 2 -A 10

echo ""
echo "=== Looking for other PV creations to see if they reference staging_directory ==="
fd ".yaml" tools/deployment/package-helm/templates/ -x rg "createLocalPv"

Repository: y-scope/clp

Length of output: 27443


🏁 Script executed:

#!/bin/bash
# Verify the exact impact by checking if staging_directory is referenced elsewhere
echo "=== All references to staging_directory in codebase ==="
rg "staging_directory" tools/deployment/package-helm/ -n

echo ""
echo "=== Verify this affects both stream and archive ==="
rg "query-worker-staged-streams-pv\|compression-worker-staged-archives-pv" tools/deployment/package-helm/ -l

Repository: y-scope/clp

Length of output: 695


Add missing staging_directory fields to values.yaml for S3 storage mode.

The PV templates reference .Values.clpConfig.stream_output.storage.staging_directory and .Values.clpConfig.archive_output.storage.staging_directory, but these fields are not defined in values.yaml. When storage.type is set to "s3", Helm renders these as null, producing invalid PersistentVolume manifests that prevent pod scheduling.

Add the following to values.yaml:

  • clpConfig.stream_output.storage.staging_directory: "/var/data/staged-streams"
  • clpConfig.archive_output.storage.staging_directory: "/var/data/staged-archives"

Alternatively, expose these as configurable values to allow customization per deployment.

🧰 Tools
🪛 YAMLlint (1.37.1)

[error] 1-1: syntax error: expected the node content, but found '-'

(syntax)


[error] 10-10: too many spaces inside braces

(braces)


[error] 11-11: too many spaces inside braces

(braces)

🤖 Prompt for AI Agents
In @tools/deployment/package-helm/templates/query-worker-staged-streams-pv.yaml
around lines 1-11, The values referenced by the PV templates
(.Values.clpConfig.stream_output.storage.staging_directory and
.Values.clpConfig.archive_output.storage.staging_directory) are missing from
values.yaml causing null paths and invalid PersistentVolume manifests; add these
two keys to values.yaml (or expose them as configurable entries) with sensible
defaults such as "/var/data/staged-streams" for
clpConfig.stream_output.storage.staging_directory and
"/var/data/staged-archives" for
clpConfig.archive_output.storage.staging_directory so the Helm templates that
call include "clp.createLocalPv" receive non-null staging_directory values and
produce valid PV specs.

Comment on lines +62 to +70
{{- with .Values.clpConfig.stream_output.storage }}
{{- if and (eq .type "s3") (eq .s3_config.aws_authentication.type "credentials") }}
- name: "AWS_ACCESS_KEY_ID"
value: {{ .s3_config.aws_authentication.credentials.access_key_id | quote }}
- name: "AWS_SECRET_ACCESS_KEY"
value: {{ .s3_config.aws_authentication.credentials.secret_access_key | quote }}
{{- end }}{{/* if and (eq .type "s3")
(eq .s3_config.aws_authentication.type "credentials") */}}
{{- end }}{{/* with .Values.clpConfig.stream_output.storage */}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: AWS credentials must not be stored in plain text in values.yaml.

The AWS credentials are being read directly from values.yaml and injected as environment variables. This is a critical security vulnerability—credentials should never be stored in plain text in Helm values or ConfigMaps.

Required fix: Store AWS credentials in a Kubernetes Secret and reference them using valueFrom.secretKeyRef, following the same pattern used for database credentials (lines 42-51).

🔒 Recommended fix using Kubernetes Secrets
  1. Create a Secret for AWS credentials (in a new file or existing secrets file):
apiVersion: v1
kind: Secret
metadata:
  name: {{ include "clp.fullname" . }}-aws-credentials
type: Opaque
data:
  access-key-id: {{ .Values.credentials.aws.access_key_id | b64enc | quote }}
  secret-access-key: {{ .Values.credentials.aws.secret_access_key | b64enc | quote }}
  1. Update the deployment to reference the Secret:
 {{- with .Values.clpConfig.stream_output.storage }}
 {{- if and (eq .type "s3") (eq .s3_config.aws_authentication.type "credentials") }}
 - name: "AWS_ACCESS_KEY_ID"
-  value: {{ .s3_config.aws_authentication.credentials.access_key_id | quote }}
+  valueFrom:
+    secretKeyRef:
+      name: {{ include "clp.fullname" $ }}-aws-credentials
+      key: "access-key-id"
 - name: "AWS_SECRET_ACCESS_KEY"
-  value: {{ .s3_config.aws_authentication.credentials.secret_access_key | quote }}
+  valueFrom:
+    secretKeyRef:
+      name: {{ include "clp.fullname" $ }}-aws-credentials
+      key: "secret-access-key"
 {{- end }}
 {{- end }}
  1. Move credentials to a dedicated section in values.yaml:
credentials:
  aws:
    access_key_id: ""  # Set via --set or secure values file
    secret_access_key: ""

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In @tools/deployment/package-helm/templates/webui-deployment.yaml around lines
62-70, The template injects AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY directly
from .Values.clpConfig.stream_output.storage, which exposes credentials in plain
text; instead, move AWS creds into a Kubernetes Secret (e.g., a secret named via
the same naming helper used elsewhere like include "clp.fullname") and change
the deployment env entries for AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY to
use valueFrom.secretKeyRef referencing that Secret (follow the same pattern used
for the DB credentials block and reference secret keys such as access-key-id and
secret-access-key); remove the direct usage of
.s3_config.aws_authentication.credentials.* in the deployment template and
ensure values.yaml contains a credentials.aws section intended only for secure
injection (or require users to supply the Secret externally).

Copy link
Contributor

@gibber9809 gibber9809 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a first pass with a bunch of questions/comments, but overall the PR looks good. As far as I can tell some of the rabbit comments seem worth addressing as well.

"component_category" "compression-worker"
"name" "staged-archives"
"nodeRole" "worker"
"capacity" "50Gi"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a huge deal, but this seems a bit high for staging.

Maybe something more like 10Gi (to match up with what we seem to provision for tmp space) might be more suitable, or 20-25Gi if we want a bit more headroom?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use 20Gi for both the staged archives and streams for now

"root" .
"component_category" "compression-worker"
"name" "staged-archives"
"capacity" "50Gi"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

Comment on lines +106 to +113
{{- if .credentials }}
credentials:
access_key_id: {{ .credentials.access_key_id | quote }}
secret_access_key: {{ .credentials.secret_access_key | quote }}
{{- if .credentials.session_token }}
session_token: {{ .credentials.session_token | quote }}
{{- end }}
{{- end }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that credentials and other privileged data should be put in a "Secret" instead of a "ConfigMap" for Kubernetes. I guess this is pretty much the same way we currently treat credentials access, so it's probably fine for now, but it might be worth changing in the future. If we don't have an issue tracking this, it might be worth creating one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, the plan is to isolate credentials from other clp-config fields so that their lifecycle can be managed more securely. Once this separation is in place, moving those values into Kubernetes Secrets would be more meaningful and appropriate

@coderabbitai in case we haven't created an issue for such refactoring, create one to track

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Search for existing issues related to credentials, secrets, or configmap refactoring
gh issue list --repo y-scope/clp --state all --search "credentials secret configmap helm" --json number,title,state,url --limit 20

Length of output: 2


🏁 Script executed:

#!/bin/bash
# Create a new issue for tracking the credentials/secrets refactoring
gh issue create \
  --repo y-scope/clp \
  --title "Refactor Helm chart to use Kubernetes Secrets for credentials instead of ConfigMaps" \
  --body "## Background

Currently, AWS credentials and other sensitive authentication data are stored in Kubernetes ConfigMaps within the Helm chart deployment. For better security practices, these should be stored in Kubernetes Secrets instead.

## Proposed approach

1. Isolate credentials from other clp-config fields so their lifecycle can be managed separately
2. Once isolated, move credential values into Kubernetes Secrets for more secure handling

## References

- PR #1825: https://github.com/y-scope/clp/pull/1825
- Discussion: https://github.com/y-scope/clp/pull/1825#discussion_r2665630820
- Requested by: @junhaoliao" \
  --assignee junhaoliao

Length of output: 42


@junhaoliao, I've created issue #1844 to track the refactoring of credentials from ConfigMaps to Kubernetes Secrets. The issue includes the proposed approach and references this PR and discussion thread.

🐰📝


🧠 Learnings used
Learnt from: hoophalab
Repo: y-scope/clp PR: 1767
File: components/clp-py-utils/clp_py_utils/clp_config.py:580-585
Timestamp: 2025-12-10T23:32:24.363Z
Learning: In PR #1767, custom S3 endpoint support was added to CLP. The S3Config.region_code field is now optional (NonEmptyStr | None) because custom S3-compatible endpoints (MinIO, LocalStack, etc.) use path-style URLs and don't require AWS region codes. Only AWS S3 endpoints require region_code. Presto integration still requires region_code because it only works with AWS S3.

{{- if eq .Values.clpConfig.logs_input.type "fs" }}
"LogsInputRootDir": "/mnt/logs",
{{- else }}
"LogsInputRootDir": null,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does LogsInputRootDir and StreamFilesDir, StreamFilesS3Region , etc.. Need to be explicitly null when not in use?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In JavaScript/TypeScript, null and undefined are distinct values with different types. Consumers of settings.json (the web UI code) expect each value to be either null or a string. While we could also tolerate undefined when a field is missing, this would require additional handling and stricter typings

"component_category" "query-worker"
"name" "staged-streams"
"nodeRole" "worker"
"capacity" "50Gi"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty much same as my other comment. Arguably we might need more space for staged streams compared to staged archives, but maybe 25Gi could be enough as a default?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lowered to 20Gi

"root" .
"component_category" "query-worker"
"name" "staged-streams"
"capacity" "50Gi"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lowered to 20Gi

Comment on lines +64 to +67
- name: "AWS_ACCESS_KEY_ID"
value: {{ .s3_config.aws_authentication.credentials.access_key_id | quote }}
- name: "AWS_SECRET_ACCESS_KEY"
value: {{ .s3_config.aws_authentication.credentials.secret_access_key | quote }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this not support session token?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, such support is missing in the wbeui. that said, officially we only support long-term credentials in the config interfaces: https://docs.yscope.com/clp/main/user-docs/guides-using-object-storage/clp-config.html#credentials

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i understand that writing this configmap.yaml might be confusing:

{{- if .credentials.session_token }}
session_token: {{ .credentials.session_token | quote }}
{{- end }}

the rationale is that in case people really only have temp credentials and want to use them, they still can do so with the supported services.

if you think that's too confusing, i can remove

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think it makes sense to keep for our own testing purposes. Since we don't document it for end users I don't think it should cause any issues.

gibber9809
gibber9809 previously approved these changes Jan 7, 2026
Copy link
Contributor

@gibber9809 gibber9809 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good to me now.

Maybe for the PR title:

feat(helm): Add S3 storage support for archives, streams, and log inputs.

@junhaoliao junhaoliao changed the title feat(helm): Add S3 storage support for archives, streams, and logs input. feat(helm): Add S3 storage support for archives, streams, and log inputs. Jan 7, 2026
@junhaoliao junhaoliao requested a review from gibber9809 January 7, 2026 22:25
@junhaoliao junhaoliao merged commit 877d0fc into y-scope:main Jan 7, 2026
21 checks passed
@junhaoliao junhaoliao deleted the helm-s3 branch January 7, 2026 22:27
davidlion pushed a commit to davidlion/clp that referenced this pull request Jan 17, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants